Inferring transcriptional regulators from expression data and ChIP-seq databases with xcore

Maciej Migdał\(^1\), Takahiro Arakawa\(^{2}\), Satoshi Takizawa\(^{2}\), Masaaki Furuno\(^{2}\),

Harukazu Suzuki\(^{2}\), Erik Arner\(^{2,3}\), Cecilia Lanny Winata\(^{1†}\), Bogumił Kaczkowski\(^{2†}\)

\(^1\) International Institute of Molecular and Cell Biology in Warsaw, Laboratory of Zebrafish Developmental Genomics, Warsaw, Poland

\(^2\) RIKEN Center for Integrative Medical Sciences, Yokohama, Japan

\(^3\) GSK, Stevenage, United Kingdom

Introduction

xcore and xcoredata R packages are open-source and freely available on GitHub. xcore user guide is available bkaczkowski.github.io/xcore.
Both packages can be installed from Bioconductor: BiocManager::install(“xcore”) BiocManager::install(“xcoredata”)

Elucidating the Transcription Factors (TF) driving the changes in gene expression is one of the most common questions asked by researchers. The existing methods rely on the predicted Transcription Factor Binding Site (TFBS) to model the changes in the motif activity.

Given the wealth of the ChIP-seq data available for a wide range of the TFs in various cell types, we propose that gene expression modelling can be done using ChiP-seq “signatures” directly. We present xcore, an R package that allows TF activity modelling based on ChiP-seq signatures and user’s gene expression data.

xcore and xcoredata packages

xcore package provides a framework for transcription factor (TF) activity modeling based on their molecular signatures and user’s gene expression data.

xcoredata package provides a collection of pre-processed TF molecular signatures constructed from ChiP-seq experiments available in ReMap2020 or ChIP-Atlas databases.

xcore recovers key TF involved in the TGFꞵ induced EMT

Standard xcore workflow starts with a matrix of gene expression counts and a design matrix describing experimental design. The user also need to select a reference time point or condition:
mae <- prepareCountsForRegression( counts = counts, design = design, base_lvl = “00hr”)
In next step user combines input gene expression with pre-processed molecular signatures:
mae <- addSignatures(mae, remap = remap_signatures)
Finally, gene expression modeling is performed:
res <- modelGeneExpression(mae = mae, xnames = “remap”)

We used xcore to perform gene expression modelling analysis in the context of TGFꞵ induced epithelial-mesenchymal transition (EMT). To this end we have used two expression datasets: novel CAGE dataset from experiment performed in A-549 and MDA-231-D cell lines, and one publicly available microarray dataset (GSE17708).

Among top-scoring ChIP-seq signatures, we identified known key TF involved in the TGFꞵ pathway such as SMAD2, SMAD3 and SMAD4.

ChIP-seq molecular signatures outperforms motif based signatures

Using 24h vs 0h comparison from our CAGE TGFꞵ induced EMT dataset, we compared the models built using ChIP-seq signatures (ReMap2020 and ChIP-Atlas) vs motif based signatures (Jaspar and SwissRegulon). The models based on ChiP-seq signatures showed on average higher \(R^2\) values than models based on motifs. While the obtained \(R^2\) values were generally low, the randomised version of ReMap2020 molecular signature yielded \(R^2\) close to 0.